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Abstract 

Many applications require comparing multimodal data with differ- 
ent structure and dimensionality that cannot be compared directly. 
Recently, there has been increasing interest in methods for learning 
and efficiently representing such multimodal similarity. In this paper, 
we present a simple algorithm for multimodal similarity-preserving 
hashing, trying to map multimodal data into the Hamming space while 
preserving the intra- and inter-modal similarities. We show that our 
method significantly outperforms the state-of-the-art method in the 
field. 



1 Introduction 

The need to model and compute similarity between some objects is central 
to many applications ranging from medical imaging to biometric security. 
In various problems in different fields we need to compare object as differ- 
ent as functions, images, geometric shapes, probability distributions, or text 
documents. Each such problem has its own notion of data similarity. 

A particularly challenging case of similarity arises in applications deal- 
ing with multimodal data, which have different representation, dimensional- 
ity, and structure. Data of this kind is encountered prominently in medical 
imaging (e.g. fusion of different imaging modalities like PET and CT) [5] 
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and multimedia retrieval (e.g. querying image databases by text keywords) 
[2] • Such data are incomparable as apples and oranges by means of standard 
metrics and require the notion of multimodal similarity. 

While such multimodal similarity is difficult to model, in many cases it 
is easy to learn from examples. For instance, in Internet vision applications 
we can easily obtain multiple examples of visual objects with a binary simi- 
larity function telling whether two objects are similar or not. Learning and 
representing such similarities in a convenient way is a big challenge. 

Particular setting of similarity representation problem is similarity sen- 
sitive hashing [TJ, which has attracted significant attention in the computer 
vision and pattern recognition communities. In j3] , we extended the boosting- 
based similarity-sensitive hashing (SSH) method to the multimodal setting 
(referred to as cross- modality SSH or CM-SSH). This is, to the best of our 
knowledge, the first and the only multimodal similarity-preserving hashing 
algorithm in the literature. 

The purpose of this paper is to develop a different simpler and efficient 
multimodal hashing algorithm. The rest of the paper is organized as fol- 
lows. In Section 2, we formulate the problem of multimodal hashing. In 
Section 3, we overview the CM-SSH algorithm. In Section 4, we propose our 
new method (cross-modality diff-hash or CM-DIF) and in Section 5 discuss 
its extension (multimodal kernel diff-hash or MM-kDIF) using kernelization. 
Section 6 shows some experimental results. 

2 Background 

Let X C W 1 and Y C W 1 ' be two spaces representing data belonging to 
different modalities (e.g., X are images and Y are text descriptions). Note 
that even though we assume that the data can be represented in the Euclidean 
space, the similarity of the data is not necessarily Euclidean and in general 
can be described by some metrics dx '■ X x X — > M + and dy : Y x Y — > K+, to 
which we refer as intra-modal dissimilarities. Furthermore, we assume that 
there exists some inter-modal dissimilarity dxy '■ X x Y — > K. + quantifying 
the "distance" between points in different modality. The ensemble of intra- 
and inter-modal structures dx,dy, dxy is not necessarily a metric in the strict 
sense. In order to deal with these structures in a more convenient way, we 
try to represent them in a common metric space. 

The broader problem of multimodal hashing is to represent the data from 
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different modalities X, Y in a common space H m = {±l} m of m-dimensional 
binary vectors with the Hamming metric dum(a,b) = y — | a A by 
means of two embeddings, £ : X — > M m and 77 : Y — > HI m mapping similar 
points as close as possible to each other and dissimilar points as distant as 
possible from each other, such that du™ ° (£ x £) « c?x, ^h™ ° (?7 x 77) « <iy, 
and (ie m (£ x rj) ~ <ixy- I n a sense, the embeddings act as a metric coupling, 
trying to construct a single metric <ie m (£ x 77) that preserves the intra- and 
inter-modal similarities. 

A simplified setting of the multimodal hashing problem is cross-modality 
hashing, in which only the inter-modal dissimilarity dxy is taken into con- 
sideration and dx,d Y are ignored. 

For simplicity, in the following discussion we assume the inter-modal dis- 
similarity to be binary, dxy G {0, 1}, i.e., a pair of points can be either 
similar or dissimilar. This dissimilarity is usually unknown and hard to 
model, however, it should be possible to sample dxy on some subset of the 
data X' C X, Y' C Y . This sample can be represented as set of similar pairs 
of points (positives) V = {(x G X',y G Y') : dxy(x,y) = 0} and a set of 
dissimilar pairs of points (negatives) M — {(x G X' , y G Y') : dxy(x, y) = 1}. 

The problem of cross-modality hashing thus boils down to find two em- 
beddings £ : X — > HP™ and n : Y — > HP such that mrfe™ (£ x 77) ps 4y. 
Alternatively, this can be expressed as having EjrfH" 1 (£ x r])!? 7 } ~ 
(i.e., the hash has high collision probability on the set of positives) and 
E{d]gm (£ x T])\J\f} ^> 0. The former can be interpreted as the false negative 
rate (FNR) and the latter as the false positive rate (FPR). 

3 Cross-modality similarity-sensitive hashing 
(CM-SSH) 

To further simplify the problem, consider embeddings given in parametric 
form as £(x) = sign(Px + a) and ry(y) = sign(Qy + b) [7J H]. Here, P, Q 
are projection matrices of size m x n and m x n' , respectively, and a, b are 
threshold vectors of size mxl. 

In [3], we introduced the cross-modality similarity- sensitive hashing (CM- 
SSH) method, which is to the best of our knowledge, the first and the only 
multimodal hashing algorithm existing to date. The idea closely follows the 
similarity- sensitive hashing (SSH) method [7], considering the hash construc- 
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tion as boosted binary classification, where each hash dimension acts as a 
weak binary classifier. For each dimension, AdaBoost is used to maximize 
the following loss function 

min V «; i (x,y)s(x,y)sign(p^x + a i )sign(q^y + &*), (1) 
(x,y)ePuAf 

where s(x, y) = 1 — 2dxy (x, y) is binary intra-modal similarity and u>j(x, y) 
is the AdaBoost weigh for pair (x, y) at ith iteration. Since the minimization 
problem (TTJ) is difficult, it is relaxed in the following way j3]: First, removing 
the non-linearity and setting a, = &j = 0, find the projection vectors p^qj. 
Then, fixing the projections p i5 q i; find the thresholds Oj, 6«. 

The disadvantages of the boosting-based CM-SSH is first high computa- 
tional complexity, and second, the tendency to find unnecessary long hashes 
(the second problem can be partially resolved by using sequential probability 
testing [1] which creates hashes of minimum expected length). 



4 Cross-modality diff-hash (CM-DIF) 

In [8], we proposed a different and simpler approach (dubbed diff-hash) to 
create similarity-sensitive hash functions in the unimodal setting. We adopt 
similar ideas here to develop multimodal similarity-sensitive hashing algo- 
rithms. 

The optimal cross-modality hashing can be found by minimizing the loss 
L = 7 E{^o(exr 7 )|7 3 }-E{d H -o(ex77)|^} 

= ^ + infrw - mz T v\r} (2) 

with respect to the embedding functions f , r], which is, up to constants, 
equivalent to minimizing the correlations 

L(P,Q,a,b) = E{sign(Px + a) T sign(Qy + b)|AT} 

- 7 E{sign(Px + a) T sign(Qy + b)\V} (3) 

w.r.t. the projection matrices P, Q and threshold vectors a, b. The first and 
second terms in ([3]) can be thought of as FPR and FNR, respectively. The 
parameter 7 controls the tradeoff between FPR and FNR. The limit case 
7 ^> 1 effectively considers only the positive pairs ignoring the negative set. 
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Problem flS]) is a highly non-convex non-linear optimization problem dif- 
ficult to solve straightforwardly. Similarly to [HI H] , we simplify the problem 
in the following way First, we ignore the threshold and solve a simplified 
problem without the sign non-linearity for projection matrices, 

min E{(Px) T (Qy)|AT}- 7 E{(Px) T (Qy)|P} s.t. P T P = I n , Q T Q = 

Second, fixing the projections we find optimal thresholds, 

min E{sign(Px + a) T sign(Qy + b)|jV} - 7E{sign(Px + a) T sign(Qy + h)\V}. 

a,b 

We details each step in Sections I4.1H4.21 The whole method is summarized 
in Algorithm [TJ 

4.1 Projection computation 

Dropping the sign function and the offset, the loss function ([3]) becomes 

L(P,Q,a,b)«L(P,Q) = E{(Px) T (Qy)|A0- 7 E{(Px) T (Q y )|P} 

= tr (PE{xy T |A/'}Q T ) - 7 tr (PE{xy T |P}Q T ) 
= tr (P(£^ y - 7 S£ y )Q T ) = tr (PS£ y Q T ) (4) 

where S^ y , S^ y denote the n x n' covariance matrices of the positive and 
negative multi-modal data, respectively, and S^ y is the weighted difference 
of these covariances. The name of the algorithm, cross-modality diff-hash 
(CM-DIF), refers in fact to this covariance difference matrix. Note that in 
order to avoid trivial solution, we must constrain the projection matrices to 
be unitary, i.e., P T P = I n and Q T Q = I n /. 

The difference of covariance matrices has a singular value decomposition 
of the form S^ y = USV T , where U and V are unitary matrices of singular 
vectors of size n x n and n' x n', respectively (U T U = I n , V T V = I n /), and 
S is a diagonal matrix of singular values of size n x n' . 

It can be easily shown that the loss L(Q, P) is minimized by setting the 
projection matrices to be the smallest left and right singular vectors of the 
matrix S^ y , respectively: P = [u n _ m+1 . . . u n ] T and Q = [v n /_ m+1 . . . v n /] T . 
From this result it also follows that the problem is separable, and each di- 
mension can be treated independently. 
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4.2 Threshold selection 

Having the projection matrices P, Q fixed, the loss function (j3J) can be writ- 
ten as 

L(a,b) = E{sign(Px + a) T sign(Qy + b)\Af} 

- 7 E{sign(Px + a) T sign(Qy + b) \V} 

= E™ i E{sign(p^ x + ai)sign(qTy + b { ) \N} 

- 7 EHi E { si s n (pJ x + a i) si s n (q7y + bi)\v} (5) 

The problem is separable and can be solved independently in each dimension 
i. We express the false positive and negative rates as a function of the 
thresholds 

FNifoA) = Pr(pT x + at < 0\V) ■ Pr(qTy + > 0\V) 
+ Pr(pT x + a 4 >0|P)-Pr(q7y + 6 i <0|P) 
= Pr(p7x < -at\V) ■ (1 - Pr(q7y < -6^)) 
+ Pr(q7y <-6 l |P)-(l-Pr(pT x <-a J |P)) (6) 

and 

FFifabi) = Pr(p7x + a, <0|AT)-Pr(q7y + 6 i <0|A^) 
+ Pr(p^x + a, > 0|A/") • Pr(q?y + h > 0\Af) 
= Pr(p7x < —a,i\N) ■ Pr(q7y < —b^N) 
+ Pr(q7y <-6 t |AT)-Pr(p7x<-a,|A^). (7) 

The above probabilities can be estimated from histograms (cumulative distri- 
butions) of p^x and q^y on the positive and negative sets. Optimal thresh- 
olds 

(a*,b*) = argmin 7 FNi(a,6) + FPi(a,6) (8) 

a,b 

are obtained by means of exhaustive search. To reduce the complexity of this 
search, we define a set of grids on the threshold parameter space. 

4.3 Hash function application 

Once the projections P, Q and thresholds a, b are computed, given new data 
points x, y, we construct the corresponding m-dimensional binary hash vec- 
tors as £(x) = sign(Px + a) and r/(y) = sign(Qy + b). 
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Algorithm 1: Cross-modality diff-hash algorithm (CM-DIF). 



Input: Positive and negative sets V,M C X x Y of multimodal data 
of dimensionality n and n'; Dimensionality of the hash m; 
Tradeoff parameter 7. 

Output: Optimal projection matrices P, Q of size m x n,m x n'\ 
optimal offset vectors a, b of size m x 1. 

1 Compute the n x n' covariance matrices S^ y , S^ y . 

2 Compute the covariance difference matrix S^ y = S^ y — 7E^ y . 

3 Perform singular value decomposition S^ y = USV T . 

4 for i — 1, . . . ,m do 

5 Set the zth row of the projection matrices to be the zth smallest 
left and right singular vectors, pj = u^_ i+1 and qj = v^,_ i+1 . 

6 Compute the probabilities Pr(p^x < - \V), Pr(q^y < -\V), 
Pr(pT x <.|Ar),Pr(qT y <.|Ar). 

7 Compute the rates FP, (cij, FNj( function of the 
thresholds a^bi according to ([6])- (JTJ) . 

8 Compute the optimal thresholds 

(a*,b*) = argmin 7FNj(a, 6) + FPj(a,6). 

a,b 



5 Multimodal kernel diff-hash (MM-kDIF) 

An obvious disadvantage of diff-hash (and spectral methods in general) com- 
pared to AdaBoost-based methods is that it must be dimensionality-reducing: 
since we compute projections P and Q as the singular vectors of a covariance 
matrix of size n x n', the dimensionality of the embedding space must satisfy 
m < min{n, n'}. In some cases, such a dimensionality may be too low and 
would not allow to correctly separate the data. A second disadvantage is 
of the cross-modality hashing problem in general, that it considers only the 
inter-modal similarity dxy, ignoring the intra-modal similarities dx,dy- 

A standard way to cope with the first problem is the kernel trick [6] , which 
transforms the data into some feature space that is never dealt with explicitly 
(only inner products in this space, referred to as kernel, are required). A 
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kernel version of the uni- modal diff-hash was described in [3J. Here, we show 
that the use of kernels also allows incorporating intra-modal similarities into 
the problem. 

Since the problem is separable (as we have seen, projection in each di- 
mension corresponds to a singular vector of the positives covariance matrix), 
we consider for simplicity one-dimensional projections. 

The whole method is summarized in Algorithm |2J Since it considers 
(though implicitly) the intra-modal dissimilarities in addition to the inter- 
modal dissimilarity, we refer to it as multimodal kernel diff-hash (MM-kDIF). 

5.1 Projection computation 

Let kx '■ X x X — > R be a positive semi-definite kernel, and let <p : x i— > 
kx(-,x). The map maps the data into some feature space, which we repre- 
sent here as a Hilbert space V (possibly of infinite dimension) with an inner 
product (-,-}v- It satisfies fcx(x,x') = (k x (-, x), k x (-, x'))v = (</>(x), </>(x')) v . 
Same way, we define the kernel ky '■ Y x Y — > R and the associated map 
■0 : y (->■ k Y (-,y) to some other Hilbert space (V, (-, -)v) f° r the second 
modality. 

The idea of kernelization is to replace the original data X, Y with the 
corresponding feature vectors (j)(X),tjj(Y), replacing the linear projections 
p T x and q T y with 

i 

p(x) = ^ai(0(xi),0(x)}v = a T [A; x (xi,x) . . .fcjc(xj,x)] 

i=l 
I' 

3=1 

respectively. Here, ct,f3 are sunknown linear combination coefficients, and 
xi, . . . , x; and y l5 . . . , y v denote some representative points of each modality 
acting as respective bases of subspaces used for the representation of data in 
each modality. 
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In this formulation, the approximate loss becomes 



l(o,/3) = 7^ E K*My)-w E K*My) 

1 ' (x,y)GA r 1 1 (x,y)e-P 

j i V 

1 1 (x,y)€JV i=i j=i 
1 z z' 

ipi E E a '^)^( x ))vE^^(^)^(y))v 

' ' (X,y)£-P i=l j=l 
. Z Z' 

= E E^^^'^E^'^^'y) 



^ (x.yJe/V i=l J = l 

Z Z' 



E E aikx ( Xi ' x ) E & fcy (y? ' y ) 

(x, y )e-p t=i i=i 

= pL a T K ^ (K A f) T /3 _^ a T K P (K?)T/3) 

where and Ky denote / x |"P| and /' x \V\ matrices, and K^" and K.y 
denote / x \Af \ and I' x \Af \ matrices with elements fcx(xj,x) and fcy(y^y), 
respectively. The optimal projection coefficients ck, /3 minimizing L are given 
as the largest left and right singular vectors of the I x /' matrix 

The kernels kx,ky can be selected in a way to incorporate the intra- 
modal similarities which are not accounted for in the previously discussed 
cross-modality hashing problem. For example, a classical choose is the Gaus- 
sian kernel, x') = e - ^ ( - x ' x '- ) and fcy(y,y') = e~ d v( y > y '\ This way, we 
account both for the inter-modal similarity dxy (through the definition of 
the positive set V) and the intra-modal similarities dx, dy (through the defi- 
nition of the kernels dx, dy). Furthermore, the dimensionality of the hash is 
now bounded by the number of the basis vectors, m < min{/, /'}, which can 
be arbitrary and in practice limited only by the training set size and compu- 
tational complexity. Finally, the use of kernels generalizes the embeddings 
to be of a more generic rather than affine form. 
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5.2 Threshold selection 

As previously, the threshold should be selected to minimize the false positive 
and negative rates for each dimension of the projection, 

FN(a, b) = Pr(p(x) < -a\V) ■ (1 - Pr(g(y) < -b\V)) 

+ Pr(g(y) < -b\V) ■ (1 - Pr(p(x) < -a\P))\ (9) 

FP(a,6) = Pr(p(x) < -a|jV) • Pr(g(y) < -b\N) 

+ Pr(g(y) < -&|AT) • Pr(p(x) < — a|jV). (10) 

The optimal thresholds are obtained as 

(a*, 6*) = argmin 7 FN(a,6) +FP(a,6). (11) 

a,b 

5.3 Hash function application 

Once the linear combination coefficients A, B and thresholds a, b are com- 
puted, given new data points x, y, we construct the corresponding m-dimensional 
binary hash vectors as £(x) = sign(A(/cx(xi, x), . . . , /cx(x;, x)) T + a) and 
V(y) = sign(B(/cy(y 1 , y), . . . , ky(y v , y)) T + b). 

6 Results 

To test the performance of the algorithms, we created simulated multimodal 
data of dimensionality n = 128 and n' = 64. In each modality, the data was 
created as follows: first, K = 25, 50, and 100 random vectors were generated 
as "centers". To each "center" (128- or 64-dimensional, respectively), i.i.d. 
Gaussian noise with different standard deviation in each dimension (varying 
between 3 — 6) was added. Binary inter-modal similarity partitioned the 
dataset into K classes. As the intra-modal dissimilarity in each modality, we 
used the Mahalanobis metric with respective diagonal covariance matrix. 

We compared boosting-based CM-SSH g] and our CM-DIF and MM- 
kDIF methods. Hash of different dimension m was used for CM-SSH and 
MM-kDIF; for CM-DIF was used. We used tradeoff parameter 7 = 10. For 
MM-kDIFF, we used bases of size I — I' — 10 3 and Gaussian kernels of the 
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Algorithm 2: Multimodal kernel cliff-hash algorithm (MM-kDIF). 

Input: Positive and negative sets V,M C X x Y of multimodal data 
of dimensionality n and n'; Dimensionality of the hash m; 
Kernels kx, ky] Bases x 1; . . . , x ; and y 1; . . . , y v . 

Output: Optimal combination coefficient matrices A, B of size 
m x Z, m x optimal offset vectors a, b of size m x 1. 

1 Compute the kernel matrices K^, Ky, K. x , K y . 

2 Perform singular value decomposition 

1 r K^(K^) T - -Lk£(K£) t = USV T 



\AT\ * K YJ \V\ 

3 for i — 1, . . . , m do 

Set the ith row of the coefficient matrices to be the ith largest left 
and right singular vectors, aj = u^_ i+1 and (3j = v^,_ i+1 . 
Compute the projections p«(x) = cxj(kxi^-i, x), . . . , kxfai, x)), 

%(y) = ^(*v(yi>y)> • • • >My«'>y))- 

Compute the probabilities Pr(pi(x) < -|P), Pr(pj(y) < -IP), 
Pr( Pi (x)<-|AT),Prfe(y)<-|Ar). 

Compute the rates FPj(aj, fej), FN»( function of the 

thresholds aj,&j according to (l9l- (fl0l) . 
Compute the optimal thresholds 

{a*,b*) = argmin7FNi(a,f)) + FPj(a,&). 

a, b 



form 

Mx,x') = e -4(x,x') = e -(x-xO^-/ 2 (x-xO. 
fc y ( y ,y') = e -4(y.y') = e -(y-y') T ^ 1/2 (y-y'). 

For CM-SSH, the settings were according to [I]. 

The training set consisted of 10 4 positive and 10 5 negative pairs. The 
training time for m = 50 was approximately 162, 0.62, and 28 seconds for 
CM-SSH, CM-DIF, and MM-kDIF, respectively. Testing was performed on 
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a set of 5 x 10 3 pairs, using data from one modality as a query and data 
from another modality as the database. Performance was measured as mean 
average precision (mAP) and equal error rate (EER). Ideal performance is 
mAP = 1 and EER = 0. 

Figures [TH2] show the performance of different multimodal hashing algo- 
rithms as a function of m for datasets with a different number of classes. 
For comparison, we show the performance of unimodal retrieval (Euclidean 
distance). Our methods clearly outperform CM-SSH both in accuracy and 
training time. Moreover, the performance of CM-SSH seems to fall dramat- 
ically with increasing complexity of the dataset (more classes), while our 
methods continue producing good performance. 
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Figure 1: ROC curves showing the performance of different multimodal hashing 
algorithms on synthetic data retrieval experiment with K = 25, 50, 100 (ordered 
left-to-right, top-to-bottom) classes. Hash length used is m = 25, 50, and 100, re- 
spectively (in the last case, m = 64 is used for CM-DIF). For comparison, unimodal 
retrieval in each modality using Euclidean distance is shown (dotted). 
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Figure 2: Performance (EER) of different multimodal hashing algorithms on syn- 
thetic data retrieval experiment with K = 25, 50, 100 (ordered left-to-right, top- 
to-bottom) classes as a function of the hash length m. For comparison, unimodal 
retrieval in each modality using Euclidean distance is shown (dotted). In the last 
case, the length of CM-DIF hash is limited by the data dimensionality. 
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